Air Quality Project Proposal

Author

Your Name

Published

Invalid Date

These libraries will be required:

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)

1 Original Data Visualization in News Media

The quality of air in urban environments is a pressing concern, intertwining public health, environmental policies, and urban planning. Our latest visualization, inspired by the comprehensive data presented by Visual Capitalist (2022), sheds light on the air quality levels across major global cities, offering fresh insights into the state of urban air pollution and its implications. Our project aims to dissect the relationship between air quality indices and urbanization patterns, unveiling trends that may correlate with the countries’ income levels. The visualization spans across various cities worldwide, providing a comparative analysis that highlights both improvements and deteriorations in air quality. While the initial visualization effectively communicates these trends to readers, there are enhancements that could further improve its clarity and depth. Future improvements could include interactive maps, detailed temporal breakdowns, and demographic overlays, allowing for a more nuanced exploration of how urban development strategies impact air quality. This comprehensive approach will enable readers to better understand and see the trends in air quality across global cities.

Figure 1: Visualized: Air Quality and Pollution in 50 Capital Cities (IQAir 2022 World Air Quality Report)

Figure 1: Visualized: Air Quality and Pollution in 50 Capital Cities (IQAir 2022 World Air Quality Report)

2 Critical Assessment of the Original Visualization

The original visualization effectively utilizes red circles to represent PM2.5 concentrations, making it easy to see the relative levels of pollution in each capital city, thereby enhancing both clarity and visual appeal. Each city is clearly labeled, which provides a direct understanding of the represented data. Additionally, the use of color and circle size to indicate levels exceeding the WHO safe limit effectively highlights critical information, drawing attention to areas with severe pollution. The quantitative clarity is also strong, as the circles’ sizes correspond to specific PM2.5 concentration ranges, providing an immediate visual grasp of pollution severity. While the visualization is not interactive, it holds potential for interactivity, which could further enhance user engagement and information depth by allowing for detailed data retrieval. Overall, the original visualization effectively communicates the general trends in PM2.5 pollution across various capital cities. However, there are several shortcomings that we have identified.

  1. Absence of Grid Lines: The lack of grid lines makes it difficult to precisely interpret data and assess scale perception, potentially causing confusion when comparing different cities.
  2. Static Year Selection: The visualization is limited to 2022 data. Including multiple years would provide a more dynamic and comprehensive view of trends over time.
  3. No Regional Differentiation: While each city is labeled, there is no clear regional differentiation which could be useful in understanding broader regional trends and patterns.
  4. Static Presentation: The circles are fixed and do not change based on user input. Dynamic elements, such as bubble sizes or color gradients that adjust over time or based on user-selected parameters, could enhance the visual representation.
  5. Lack of Interaction: There are no interactive elements like info-tips or toggle buttons that allow users to explore the data in more depth or switch views between different time periods or concentration ranges.
  6. Data Density: In highly polluted areas (e.g., cities with PM2.5 levels above 50 µg/m³), the circles become dense and can overlap, making it harder to distinguish individual data points.

3 Proposed Improvements

  1. Improved color coding: The color representation of the data can be improved by adding different colors and their corresponding gradients to allow for more distinct data representation (e.g., using green to signify good air quality, yellow to signify moderate air quality and red to signify bad air quality).
  2. Distinction for regional data: Grouping the cities by countries and countries by region will allow for a better representation of the air quality within the countries and region respectively.
  3. Addition of comparison and filtering options: Filters that highlight a specific or multiple countries and regions will help users view the data according to what they wish to view and compare instead of looking through the whole list of countries or regions to find a specific data point.
  4. Interactive bubbles: Hovering over the bubble will show the PM2.5 concentration related to the country or region it represents.
  5. Expansion on viewable data: Including options to view historical data from before 2022 will help in trend studies regarding the past history of air quality for the country or region.
  6. Change or increase in ways for data representation: Include other ways that the data can be viewed such as a heat map, bar charts or line graphs. Have only a few particles show on the graph and change the scale of the graph instead of having overlapping particles.

4 Data Cleaning

4.1 Data Source Summary

The original data set used for the visualization was sourced from the IQAir 2022 World Air Quality Report. This data however does not appear to be available to the public thus we will be using another dataset from the World Health Organization (WHO) which provides data on air quality for various countries. The dataset contains information on PM2.5 concentrations for different countries and years. Below are the glimpse() and summary() summaries of the data.

airReport <- readxl::read_excel("who_aap_2021_v9_11august2022.xlsx")
glimpse(airReport)
Rows: 32,191
Columns: 15
$ `WHO Region`                             <chr> "Eastern Mediterranean Region…
$ ISO3                                     <chr> "AFG", "ALB", "ALB", "ALB", "…
$ `WHO Country Name`                       <chr> "Afghanistan", "Albania", "Al…
$ `City or Locality`                       <chr> "Kabul", "Durres", "Durres", …
$ `Measurement Year`                       <dbl> 2019, 2015, 2016, 2015, 2016,…
$ `PM2.5 (μg/m3)`                          <dbl> 119.77, NA, 14.32, NA, NA, NA…
$ `PM10 (μg/m3)`                           <dbl> NA, 17.65, 24.56, NA, NA, NA,…
$ `NO2 (μg/m3)`                            <dbl> NA, 26.63, 24.78, 23.96, 26.2…
$ `PM25 temporal coverage (%)`             <dbl> 18, NA, NA, NA, NA, NA, NA, N…
$ `PM10 temporal coverage (%)`             <dbl> NA, NA, NA, NA, NA, NA, NA, N…
$ `NO2 temporal coverage (%)`              <dbl> NA, 83.96119, 87.93260, 97.85…
$ Reference                                <chr> "U.S. Department of State, Un…
$ `Number and type of monitoring stations` <chr> "NA", "NA", "NA", "NA", "NA",…
$ `Version of the database`                <dbl> 2022, 2022, 2022, 2022, 2022,…
$ Status                                   <lgl> NA, NA, NA, NA, NA, NA, NA, N…
summary(airReport)
  WHO Region            ISO3           WHO Country Name   City or Locality  
 Length:32191       Length:32191       Length:32191       Length:32191      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 Measurement Year PM2.5 (μg/m3)     PM10 (μg/m3)     NO2 (μg/m3)    
 Min.   :2000     Min.   :  0.01   Min.   :  1.04   Min.   :  0.00  
 1st Qu.:2014     1st Qu.: 10.35   1st Qu.: 16.98   1st Qu.: 12.00  
 Median :2016     Median : 16.00   Median : 22.00   Median : 18.80  
 Mean   :2016     Mean   : 22.92   Mean   : 30.53   Mean   : 20.62  
 3rd Qu.:2018     3rd Qu.: 31.00   3rd Qu.: 31.30   3rd Qu.: 27.16  
 Max.   :2021     Max.   :191.90   Max.   :540.00   Max.   :210.68  
                  NA's   :17143    NA's   :11082    NA's   :9991    
 PM25 temporal coverage (%) PM10 temporal coverage (%)
 Min.   :  0.00             Min.   :  2.568           
 1st Qu.: 88.60             1st Qu.: 87.945           
 Median : 97.00             Median : 96.039           
 Mean   : 90.79             Mean   : 90.583           
 3rd Qu.: 99.00             3rd Qu.: 98.938           
 Max.   :100.00             Max.   :100.000           
 NA's   :24916              NA's   :26810             
 NO2 temporal coverage (%)  Reference        
 Min.   :  1.923           Length:32191      
 1st Qu.: 93.208           Class :character  
 Median : 96.370           Mode  :character  
 Mean   : 93.697                             
 3rd Qu.: 98.927                             
 Max.   :100.000                             
 NA's   :12301                               
 Number and type of monitoring stations Version of the database  Status       
 Length:32191                           Min.   :2016            Mode:logical  
 Class :character                       1st Qu.:2022            NA's:32191    
 Mode  :character                       Median :2022                          
                                        Mean   :2022                          
                                        3rd Qu.:2022                          
                                        Max.   :2022                          
                                                                              

4.2 Handling of Missing Values

Based on the above summaries, we can see that there are missing values in the dataset in the PM columns. We will need to handle these missing values before proceeding with the changes. Some methods we can use to handle missing values include:

  1. Dropping Missing Values: We can drop rows with missing values if they are not significant in number.
  2. Imputation: We can impute missing values with the mean, median, or mode of the column.

We will impute missing values with the mean of the column using the fill() function from the tidyr package.

aap_data_cleaned <- airReport #place holder

numeric_columns <- c("PM2.5 (μg/m3)", "PM10 (μg/m3)", "NO2 (μg/m3)", "PM25 temporal coverage (%)", "PM10 temporal coverage (%)", "NO2 temporal coverage (%)")

aap_data_cleaned <- aap_data_cleaned %>%
  mutate(across(all_of(numeric_columns), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))

4.3 Normalizing Column Names

We will also need to normalize the column names to ensure consistency and ease of access. This will involve converting all column names to lowercase, replacing spaces with underscores, and removing special characters.

colnames(aap_data_cleaned) <- tolower(gsub(" ", "_", colnames(aap_data_cleaned)))

4.4 Data Type Conversion

We will convert the data types of the columns to their appropriate types. For example, the year column should be converted to a date type if it is not already in that format.

aap_data_cleaned$measurement_year <- as.integer(aap_data_cleaned$measurement_year)

4.5 Removing Duplicates

We will check for and remove any duplicate rows in the dataset to ensure data integrity.

aap_data_cleaned <- aap_data_cleaned %>% distinct()

4.6 Cleaned Data

This is the cleaned data

head(aap_data_cleaned)
# A tibble: 6 × 15
  who_region            iso3  who_country_name city_or_locality measurement_year
  <chr>                 <chr> <chr>            <chr>                       <int>
1 Eastern Mediterranea… AFG   Afghanistan      Kabul                        2019
2 European Region       ALB   Albania          Durres                       2015
3 European Region       ALB   Albania          Durres                       2016
4 European Region       ALB   Albania          Elbasan                      2015
5 European Region       ALB   Albania          Elbasan                      2016
6 European Region       ALB   Albania          Elbasan                      2017
# ℹ 10 more variables: `pm2.5_(μg/m3)` <dbl>, `pm10_(μg/m3)` <dbl>,
#   `no2_(μg/m3)` <dbl>, `pm25_temporal_coverage_(%)` <dbl>,
#   `pm10_temporal_coverage_(%)` <dbl>, `no2_temporal_coverage_(%)` <dbl>,
#   reference <chr>, number_and_type_of_monitoring_stations <chr>,
#   version_of_the_database <dbl>, status <lgl>

5 Conclusion

The data has now been cleaned and is ready for visualization, we will be using ggplot2 to create the visualizations. The proposed improvements will be implemented to enhance the clarity and depth of the visualization, providing a more interactive and informative experience for users. By incorporating these enhancements, we aim to create a more engaging and insightful visualization that effectively communicates the trends in air quality across global cities.

6 References

Rao, P. (2024, January 6). Visualized: Air quality and pollution in 50 capital cities. Visual Capitalist.
https://www.visualcapitalist.com/cp/air-quality-in-cities-2022/

Air quality database 2022. (2024, June 20). https://www.who.int/data/gho/data/themes/air-pollution/who-air-quality-database/2022